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Estimating errors is a crucial part of any scientific analysis. Whenever a parameter is 
estimated (model-based or not), an error estimate is necessary. Any parameter estimate 
that is given without an error estimate is meaningless. Nevertheless, many (undergraduate 
or graduate) students have to teach such methods for error estimation to themselves when 
working scientifically for the first time. This manuscript presents an easy-to-understand 
overview of different methods for error estimation that are applicable to both model-based 
and model-independent parameter estimates. These methods are not discussed in detail, 
but their basics are briefly outlined and their assumptions carefully noted. In particular, 
the methods for error estimation discussed are grid search, varying x 2 , the Fisher matrix, 
Monte-Carlo methods, error propagation, data resampling, and bootstrapping. Finally, a 
method is outlined how to propagate measurement errors through complex data-reduction 
pipelines. 



1 Introduction 

This manuscript is intended as a guide to error estimation for parameter estimates in astron- 
omy. I try to explain several different approaches to this problem, where the emphasis is on 
highlighting the diversity of approaches and their individual assumptions. Making those as- 
sumptions explicitly clear is one of the major objectives, because using a certain method in a 
situation where its assumptions are not satisfied will result in incorrect error estimates. As this 
manuscript is just an overview, the list of methods presented is by no means complete. 

A typical task in scientific research is to make measurements of certain data and then to 
draw inferences from them. Usually, the inference is not drawn directly from the data but 
rather from one or more parameters that are estimated from the data. Here are two examples: 

• Apparent magnitude of stars or galaxies. Based on a photometric image, we need to 
estimate the parameter "flux" of the desired object, before we can infer its apparent 
magnitude. 

• Radial velocity of stars. First, we need to take a spectrum of the star and identify appro- 
priate emission/absorption lines. We can then estimate the parameter "radial velocity" 
from fitting these spectral lines. 

Whenever such parameter estimates are involved, it is also crucial to estimate the error of the 
resulting parameter. 

What does a parameter estimate and its error actually signify? More simply, what is the 
meaning of an expression such as 4.3 ± 0.7? This question will be answered in detail in Sects. 



2.3 and 2J3 but we want to give a preliminary answer here for the sake of motivation. The 
crucial point is that the true result of some parameter estimate is not something like 4.3 ± 0.7, 
but rather a probability distribution for all possible values of this parameter. An expression 
like 4.3 ± 0.7 is nothing more than an attempt to encode the information contained in this 
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probability distribution in a more simple way, where the details of this "encoding" are given 



by some general standards (cf. Sect. 2.5). Put simply, the value 4.3 signifies the maximum of 
the probability distribution (the most likely value), whereas the "error" 0.7 signifies the width 
of the distribution. Hence, the value 4.3 alone contains insufficient information, since it does 
not enable us to reconstruct the probability distribution (the true result). More drastically: 
A parameter value without a corresponding error estimate is meaningless. Therefore, error 
estimation is equally as important an ingredient in scientific work as parameter estimation itself. 
Unfortunately, a profound and compulsory statistical eduction is missing in many university 
curricula. Consequently, when (undergraduate or graduate) students are faced with these 
problems for the first time during their research, they need to teach it to themselves. This is 
often not very efficient and usually the student focuses on a certain method but does not gain 
a broader overview. The author's motivation was to support this process by providing such an 
overview. 

Where do uncertainties stem from? Of course, an important source of uncertainties is the 
measured dataset itself, but models can also give rise to uncertainties. Some general origins of 
uncertainties are: 

• Random errors during the measurement process. 

• Systematic errors during the measurement process. 

• Systematic errors introduced by a model. 

We obviously have to differentiate between random and systematic errors, i.e., between vari- 
ance/scatter and bias/offset. Systematic errors are usually very hard to identify and to correct 
for. However, this is not part of this manuscript, since individual solutions strongly depend on 
the specific problem. Here, different methods of estimating random errors (variance/scatter) 
are considered, i.e., those quantities that determine the size of error bars or, more generally, 
error contours. Error estimation for parameter estimation only is described, whereas error 
estimation for classification problems is not discussed. 

This manuscript will not be submitted to any journal for two reasons: First, its content 
is not genuinely new but a compilation of existing methods. Second, its subject is statistical 
methodology rather than astronomy. Any comments that may improve this manuscript are 
explicitly welcome. 



2 Preliminaries 

Before diving into the different methods for error estimation, some preliminaries should be 
briefly discussed, firstly, the terminology used throughout this manuscript and, secondly, errors 
of data. Thirdly, the basics of parameter estimation are briefly explained, including the intro- 
duction of the concept of a likelihood function. Fourthly, the central-limit theorem is discussed. 
Finally, the concept of confidence intervals, which are the desired error estimates, is introduced. 



2.1 Terminology 

This manuscript is about "error estimation for parameter estimates" . The first step is usually 
to measure some data and also to measure its error or uncertainty (Sect. 2.2). Given this 
measurement, the task is then to estimate some parameter (Sect. 2.3). The estimate of the 
parameter 9 is denoted by a hat, 9, which is common practice in statistics. Here, I want 
to introduce the concept of a qualitative difference between "measuring" and "estimating": 
Measurements are outcomes of a real experiment. Conversely, estimates are inferences from 
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Figure 1: Examples of Poisson distribu- 
tions with \i = 1, \i = 2, ii = 4, and /i = 8. 



Figure 2: Poisson distribution with fi = 10 
and Gaussian distribution with a 2 = /i = 
10. The Gaussian is a very good approxi- 
mation to the Poisson distribution. 



measurements, i.e., they are not directly related to experiments. Although this difference is 
not of vital importance, both terms are rigorously differentiated throughout this manuscript in 
order to make clear what is being referred to. 

Another issue of terminology concerns the words "error" and "uncertainty" . As mentioned 
in the introduction, systematic errors are not considered here, and hence both terms may 
be used more or less synonymously]]] There is also a third word of the same family, namely 
"noise", which could also be used synonymously. However, personally I would use the word 
"noise" only in the context of measurements, whereas in the context of parameter estimates 
the word "uncertainty" appears to be most natural. 



2.2 Errors of data 

Real data is always subject to noise, i.e., there is always some uncertainty]^] The precise origin 
of the noise, e.g., read-out noise, sky noise, etc., is not relevant for this manuscript. Most 
methods of error estimation for parameters require knowledge about the error distribution (or 
noise distribution) of the measured data. Random scatter (variance, noise) determines the 
width of the error distribution, whereas the mean of the error distribution is determined by the 
physical signal which may be shifted by a systematic error (bias). The easiest way to measure 
the data's error distribution is to repeat an identical measurement process and to monitor 
the distribution of the results. However, this is often infeasible, e.g., because a measured 
event happens rarely or because the measurement process itself is expensive (in time, money, 
computational effort, etc.). 

Fortunately, for many measurement processes in astronomy the mathematical type of the 
error distribution is known from physical arguments, e.g., whenever spectroscopy or photometry 
are carried out. In this case, the physical measurement process is counting photons in a certain 
pixel. This photon counting is usually assumed to be a Poisson process, i.e., it follows the 
Poisson distribution. Given a mean value of photons in a certain pixel, the probability of 



Hogg et al. (2010) argue that the word "error" is not a good terminology anyway and that it should be 
replaced by "uncertainty" . However, their argument assumes that systematic errors (biases) can always be 
corrected. 

2 One can think of real data as being one of many different "noise realisations" of the truth, but not being 
the truth itself. 



3 



Rene Andrae (2010) - Error estimation in astronomy: A guide 



measuring n photons in this pixel is given by 
/x n 

prob(?x|/x) = ~~j~ e ^ ■ (1) 

Figure [T] shows examples of Poisson distributions with different mean values. In practice, the 
mean /i is usually unknown and the task is to estimate the value of /x given n. We emphasise 
that the mean of the Poisson distribution does not coincide with the maximum position (the 
mode), since the Poisson distribution is not symmetric. The fact that the Poisson distribution 
is defined for positive integers only already implies that it cannot be symmetric. 

Modern instruments have high sensitivities and it is also common practice (if possible) to 
choose the exposure time in the measurement such that the number of photon counts per pixel 
is large. The obvious reason to do this is to improve the signal-to-noise ratio. However, there is 
also another benefit: If the number of photon counts is large, the expected value of /x will also 
be large. In the case of large /x, the Poisson distribution approaches the Gaussian distribution, 



prob(ra|/i, a) 



V2 



cxp 



7T(T Z 



1 (n - fi) 2 
2 



(2) 



which has much more convenient analytical properties than the Poisson distribution, as we 
shall see in Sections 2^3 and 3^3 Figure [2] shows a Poisson distribution with mean /x = 10 



and a Gaussian with mean /x = 10 and variance a — xx = 10. Evidently, for a mean of only 
ten photon counts per pixel, the actual Poisson distribution can be nicely approximated by a 
Gaussian already. This is usually a valid approximation in the optical regime and at larger 
wavelengths, whereas in the high-energy regime (UV, X-ray, gamma) it is not unheard of that 
there are less than ten photon counts per pixel. 



2.3 Parameter estimation 

I now briefly discuss the concept of parameter estimation. Here, the focus is on model-based 
estimates (e.g. estimating a mean), because they all have a common concept, whereas model- 
independent parameter estimation (e.g. estimating image flux) is a very diverse field and usually 
self-explanatory. 

The common feature of model-based parameter estimation is that it is usually an optimi- 
sation (or "fitting") problem. The task at hand is to optimise a certain score function, e.g., to 
minimise residuals or - more technically - to maximise the likelihood function^ For a given set 
of measured data D and a given model M with parameters 9, the likelihood function is defined 
via the noise model, 

C(D; M, 9) = probplM, 9) . (3) 

In words, the likelihood is the probability of measuring the data D we have, given the model M 
with parameters 9^\ Maximum-likelihood estimation then means that we choose the parameter 
values such that the data we did measure were the most likely measurement outcome]^] The set 

3 Bayesians prefer to maximise the posterior probability instead. I will restrict the discussion on likelihood 
functions alone, since Bayesians should be happy with that assuming flat priors, whereas discussing posterior 
probabilities would probably enrage some frequentists. 

4 Quoting Heavens (2009): If you are confused by the conditional probabilities prob(A|i?) of A given B and 



prob(i?|^4) of B given A consider if A=pregnant and i3=female. Then prob(A|£?) is a few percent whereas 
prob(_B|^4) is unity. 

5 Note the subtle fact that this is actually not the question we are asking. We actually want to know what are 
the most likely parameters given the data and model. Hence, we should actually maximise prob(#|M, D) rather 



than prob(Z?|M, 9). However, this would lead us to Bayes' theorem and the issue of priors (e.g. see Barlow 



1993 ) which are the cause of the long-ongoing Bayesian vs. frequentist discussion in the scientific community. 
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Table 1: Data sample used as a standard example for all methods. All data points x n are 
sampled from a Poisson distribution with mean \i = 10 (cf. Fig. [2]). The columns entitled 
"error" give the Gaussian standard deviations a n for each data point x n for the cases where 
the error distribution is assumed to be Gaussian. 



of parameters that maximise the likelihood function is called the maximum-likelihood estimate 
and it is usually denoted by 6. 

A somewhat philosopical note: Actually, the likelihood function as given by Eq. (|3| is what 
every parameter estimation is aiming for. This function, £(D;M, 6), contains all important 
information about the data and the model, a theorem which is called likelihood principle. 
However, Eq. ^ is just an abstract definition and even a more specific example (e.g. Eqs. ^ 
and (|5])) usually does not provide more insight. Therefore, one has to extract the information 
from Eq. ^ in some way. If the model under consideration has only one or two model 
parameters, it is possible to plot the likelihood function directly (e.g. Fig. [8]), without involving 
any optimisation procedure. Although such a plot is actually the final result of the parameter- 
estimation process, people (including myself) are usually happier giving "numbers" . Moreover, 
if a model has more than two parameters, the likelihood function cannot be plotted anymore. 
Hence, the standard practise of encoding the information contained in the likelihood function 
is by identifying the point in parameter space where the likelihood function takes its maximum 
(the maximum-likelihood estimate) plus inferring the "width" of the function at its maximum 
(the uncertainty). If nothing contrary is said, these two quantities usually signify the mean 
value and the standard deviation of a Gaussian. Consequently, if both values are provided, one 
can reconstruct the full likelihood function. 

In order to "give some flesh" to the rather abstract concept of a likelihood function, two 
simple examples of parameter estimation are now discussed. This allows us to see this concept 
and the Poisson and Gaussian distributions "in action". Table[T]also introduces the data sample 
that will be used to demonstrate every method that is discussed using actual numbers. 



2.3.1 Example 1: Estimating the mean of a Gaussian distribution 



The first example is to estimate the mean fi of a sample of N measured data points D = 
{x\, X2, ■ ■ ■ , xn} that are all real values. The assumption is that all data points have a Gaussian 



error distribution, i.e., 



prob(x n |yU,(J r , 



exp 



1 (x n - pi) 2 
2 



o: 



(4) 



where we assume that each measurement x n has its own standard deviation a n . The probability 
of all N measurements D = {x±, X2, ■ ■ ■ ,xn} ~ the likelihood function - is (in the case of mi- 
correlated measurements) just the product of the probabilities of the individual measurements, 



i.e., 



N 



£(D;fi) = JJprob(x n |/i,a n ) 



(5) 



n=l 
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There are two reasons for maximising log£ instead of C. First, log£ sometimes takes a much 
more convenient mathematical form, enabling us to solve the maximisation problem analyti- 
cally, as we shall see immediately. Second, £ is a product of N potentially small numbers. If 
N is large this can cause a numerical underflow in the computer. As the logarithm is a strictly 
monotonic function, the maxima of C and log£ will be identical. The logarithmic likelihood 
function is given by 

N 

log £(D;//) = ^logprob(x n |/i,cr n ) . (6) 

n=l 

Inserting Eq. Q yields 



n=l 



it 



where C encompasses all terms that do not depend on \x and are therefore constants during 
the maximisation problem. We can now identify the sum, 

x 2 = f;^hr^> (8) 

n=l ° n 

such that log£(.D; /i) = — \x 2 + C. In other words, maximising the likelihood function in the 
case of Gaussian noise is equivalent to minimising x 2 J3 ^ n or der to estimate /i, we now take the 
first derivative of Eq. §7§ or Eq. (J8J) with respect to /x, set it equal to zero and try to solve the 
resulting equation for \i. The first derivative of Eq. ^ set to zero then reads 

dlogCjD; fj) = (x n - fi) 
dl* hi < 

and solving for \x yields the maximum-likelihood estimate 

A = f^4- do) 

This estimator is a weighted mean which underweights data points with large measurement 
errors, i.e., data points that are very uncertain. For the example data set of Table [T] we get 
fx ~ 10.09. This result can be simplified by assuming that all data points have identical standard 
deviations, i.e., a n = a for all x n . Our result then reads 

£ = ^f^, (ii) 

which is simply the arithmetic mean. For the example data set of Table[l]we then get fi ~ 10.07. 



The derivation of the corresponding error estimation of fi is postponed to Sect. 3.3 



6 Actually, we should say that minimising x 2 provides the correct estimator if and only if the error distribution 
of the data is Gaussian. If the error distribution is not Gaussian, then minimising squared residuals may well 
be plausible but it is not justified. 
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2.3.2 Example 2: Estimating the mean of a Poisson distribution 

The second example is precisely the same task, but now the error distribution of the iV mea- 
surements D = {xi, X2, ■ ■ ■ , xn} (which are now all integers) should be a Poisson distribution 
as given by Eq. ([I]). Again, we estimate the mean \i by maximising the likelihood function of 
the data 

TV 

C(D ]f i) = JJprob(x n |/x), (12) 

n=l 

or rather the logarithmic likelihood function 

log C(D; fi) = J2 l^-re-n = -N // + log(^) ]T x n + C , (13) 

n=l x ' n=l 

where C again summarises all terms that do not depend on fi. Taking the first derivative with 
respect to \i and setting it to zero yields 

dlog£(D;^) = _ iV+ l^ 

n=l 

Solving for /x then yields the maximum-likelihood estimate 

N 

X 



/' z x r J2 x n, (15) 



n=l 

which is the arithmetic mean, again. Do not be mistaken by the fact that the result was identical 
for the Gaussian and the Poisson distribution in this example. In general, trying to estimate a 
certain quantity assuming different error distributions also results in different estimators. 

2.3.3 Example 3: Estimating a fraction 

Finally, we also want to consider a third example which will turn out to be less well behaved 
than the first two. Let us consider the situation that we are given a set of iV objects, say 
galaxies, of which n are in some way different from the other N — n objects, say they are active 
galaxies. The task is to estimate the fraction / of special objects from these numbers. The 
natural assumption here is that the likelihood function is a binomial distribution, i.e., 

71 1 

C(D; f) = C{n, N] f) = ^ - ^ /"(I - . (16) 

The desired fraction is limited to the interval / £ [0,1]. What value of / maximises the 
likelihood of the measured values of iV and n? Let us first compute the logarithmic likelihood, 

log C(D- f) —n log f + (N — n) log(l -f)+C, (17) 

where C contains everything that does not depend on /. The first derivative of log£ w.r.t. / 
reads, 

Slog £(£>;/) _ n N-n 

df (18) 

Equating this to zero and solving for / yields the maximum-likelihood estimator 

< 19 > 
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provided that neither n nor / nor N are zeroQ We will consider this example in two flavours: 



1. N = 10 and n — 0, i.e., the given sample is very small and contains no special objects. 
An error estimate enables us to assess whether this rules out the existence of these special 
objects. 

2. N = 30 and n = 4. 

Figure [3] shows the binomial likelihood functions for both cases. An example where a binomial 
distribution shows up in astronomy can, e.g., be found in Cisternas et al. (2010). 



2.3.4 More general error distributions 

The task of model-based parameter estimation 
as outlined in the previous subsections can be 
summarised as follows: 

1. Identify the error distribution of the data 
and write down the (logarithmic) likeli- 
hood function. 

2. In order to maximise the (logarithmic) 
likelihood function, take the first deriva- 
tive^) with respect to the desired model 
parameter(s) and set it equal to zero. 

3. Try to solve the resulting (system of) equa- 
tion^) for the desired fit parameter (s). 

The first two steps are usually not that hard. In 
the two examples given above, step 3 was also 
fairly simple because our model was simple (our 
model was a constant fi) and the error distri- 
butions were well-behaved. However, in general 
it is not possible to perform step 3 analytically 
for more complex models or other error distri- 
butions. In this case the optimisation has to be 
done numerically. 




W=10, n=0 
2V=30, n =4 



0.6 



0.8 



1.0 



Figure 3: Binomial likelihood functions for 
the fraction / e [0, 1] for example 3. For the 
sake of visualisation the likelihood functions 
are not normalised, which is why the ?/-axis 
is unlabelled. 



MacKay (2003) provides an overview of the most important distribution 



functions that may - in principle - appear as error distributions. 

Nonetheless, we should not be mistaken by the simplicity of the previous examples. Replac- 
ing the error distribution measured in some experiment by an analytic distribution function, 
such as a Poisson or Gaussian, is merely a makeshift. We have to keep in mind that it may 
happen in practice that for some measured error distribution it may not be possible to find 
such an analytic parametrisation. In such cases, things get substantially more difficult as we 
can no longer write down the likelihood function. Even so, we could still work with the whole 
unparametrised error distribution. 



7 This result should not come as a suprise. That the fraction is estimated via n/N would have been a plausible 
guess, but mind that we now have seen why it is justified. 
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2.4 Central-limit theorem 



The central-limit theorem is a key result of statistics and shows up repeatedly throughout 
this manuscript. In simple words, the central-limit theorem tells us that if certain regularity 
conditions are met, any likelihood function is asymptotically Gaussian near its maximum. Let 
us denote log£ = log C(9) as a function of the P parameters 9 = {9i, . . . ,9p} and Taylor 
expand it around the maximum # max to second order, 



log£(0)«log£(0 E 



+ 



1 <9 2 log£ 



2 09 l d9 j 




Figure 4: Possible failures of the central-limit 
theorem. This figure shows an example like- 
lihood function £ as a function of some pa- 
rameter 9 (solid blue curves), its Gaussian 
approximation at the maximum (dashed red 
curves), and the widths of these Gaussians 
(horizontal solid black lines). In panel (a) the 
Gaussian approximation is good and the er- 
ror estimate reliable. In panel (b) the Gaus- 
sian approximation is rather poor and the 
error is substantially underestimated. 



0„ 



9 



max J j 



(20) 



The linear term vanishes at the maximum, be- 
cause the first derivative evaluated at 9 m3uX be- 
comes zero. Consequently, \ogC(9) is approxi- 
mately a quadratic form in 9 near the optimum, 
i.e., C(9) = e log£( ^ is a Gaussian. The goodness 
of the Gaussian approximation depends on the 
specific situation, i.e., on the measured data, its 
noise, and the model used. The approximation is 
good, if we are close enough. However, for error 
estimation we cannot go arbitrarily close to the 
maximum of the likelihood function. In fact, the 
Gaussian approximation can be arbitrarily poor 
in practice. Figure [4] shows an example where 
this is the case. The central-limit theorem may 
be particularly problematic in case of parameters 
that are not defined on the interval (— oo, oo) of 
a Gaussian but, e.g., are constrained on the in- 
terval [0, 1] such as the fraction of example 3. 
Moreover, there are non-pathologic cases where 
the likelihood function never becomes Gaussian 
at its maximum. For instance, consider exam- 
ple 3.1 and Fig. [3j where the likelihood function 
even does not have a real maximum where its 
first derivative would be zero. In this particular 
case, the Taylor expansion of Eq. (20) breaks 



down, because we cannot compute any deriva- 
tives at the "maximum" at all. Consequently, whenever invoking the central-limit theorem, 
one should carefully check the goodness of the Gaussian approximation. 



2.5 Confidence intervals 

In order to explain methods for error estimation, confidence intervals have to be considered. 
We are all familiar with the concept of error bars, error ellipses or - in general - error contours. 
They are different manifestations of confidence intervals. These confidence intervals are inferred 
from the likelihood function. 



2.5.1 A single parameter 

Initially, confidence intervals for a single parameter are considered, in order to keep things sim- 
ple. Figure [5] shows the most simple example - confidence intervals of a Gaussian distribution. 
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Figure 5: Confidence intervals for the 
Gaussian distribution of mean (9) and 
standard deviation a. If we draw iV values 
of 9 from a Gaussian distribution, 68.3% 
of the values will be inside the interval 
[(9) — a, (9) + a] as shown in panel (a), 
whereas 95.5% of the values will be inside 
the interval [(9) — 2a, (9) + 2a] as shown in 
panel (b). 



Figure 6: Different types of 68.3% confi- 
dence intervals for a multimodal likelihood 
function. The vertical dashed red line indi- 
cates the maximum-likelihood estimate 9. 
The panels are numbered according to the 
definitions in the main text. 



If we draw a sample value 9 from a Gaussian with mean (9) and standard deviation a, e.g., by 
trying to estimate (9) from measured data, the deviation \9 — (9)\ will be smaller than la with 
68.3% probability, and it will be smaller than 2a with 95.5% probability, etc. In simple words, 
if we fit some function to N data points with Gaussian errors, we have to expect that 31.7% of 
all data points deviate from this fit by more than one sigmaj^] 

The Gaussian is an almost trivial example, due to its symmetry around the mean. In 
general, likelihood functions may not be symmetric and how to define confidence intervals in 
these cases should be explained. For asymmetric distributions mean and maximum position do 
not conincide (e.g. Poisson distribution). The actual parameter estimate 9 is the maximum- 
likelihood estimate, i.e., it indicates the maximum of the likelihood function, not its mean. We 
define the confidence interval 9- < 9 < 9 + for a given distribution function prob(#) via (e.g. 



Barlow 1993) 



prob(6L 



<9<9 + ) = / d0prob(0) = C 

J8- 



(21) 



where usually C = 0.683 in analogy to the one-sigma-interval of the Gaussian. In practice, the 
distribution function prob(#) is usually unknown and only given as a histogram of samples of 



9. In this case, the integral in Eq. (21 ) reduces to the fraction of all samples 9 that are between 
9- and 9 + . However, Eq. (21) does not uniquely define the confidence interval, an additional 



criterion is required. Possible criteria are (e.g. Barlow 1993): 



1. Symmetric interval: 6L and 9 + are symmetric around the parameter estimate, i.e., 9—9- 
9,-9. 



2. Shortest interval: 9 + — is smallest for all intervals that satisfy Eq. (21). 



5 If you are presented fitted data where the fit goes through all la-errorbars, you should definitely be sceptical. 
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0.6827 0.3935 


0.1988 


0.1149 


0.0715 


0.0466 


0.0314 


0.0217 0.0152 


0.0109 



Table 2: Confidence cp contained within a la contour of a P-dimensional Gaussian as given by 
Eq. (23). In ten dimensions, the la contour contains a ridiculously small confidence of ~ 1.1%. 



3. Central interval: The probabilities above and below the interval are equal, i.e., j°_ d9 prob(#) 
/~d0prob(0) = (l-C)/2. 

In case of a symmetric distribution, e.g., a Gaussian, all three definitions are indeed equivalent. 
However, in general they lead to different confidence intervals. Figure [6] shows the 68.3% 
confidence interval^] resulting from the three definitions for an example distribution that could 
be a likelihood function resulting from a parameter estimate. In practice, there is usually no 
preference for any of these definitions 
used. 
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it should only be made explicitly clear which one is 



2.5.2 Two or more parameters 

If we are estimating two or more parameters and are interested in estimating the joint confidence 
region, things become considerably more difficult. This difficulty largely stems from the fact 
that multiple parameters will usually exhibit mutual correlations. The following discussion 



largely follows Barlow (1993). 



First, consider the case where the central-limit theorem indeed ensures that some (mul- 
tidimensional) likelihood function is approximately Gaussian at its maximum position. Such 
(multivariate) Gaussians, with P-dimensional mean vector \x and P x P covariance matrix S, 



prob(x|/7, S) 



v /(2vr) p detS 



cxp 



--(£- fiy ■ xr 1 -(x-ji) 



(22) 



provide ellipsoidal error contours, i.e., they are capable of describing linear correlations in the 
parameters, but not nonlinear correlations such as "banana-shaped" error contours. However, 
even in this simple case, things are complicated. The reason for this is that the one-sigma 
contour no longer marks a 68.3% confidence region as in Fig. [5] It is straight-forward to 
compute that the one-sigma contour of a two-dimensional Gaussian marks a 39.4% confidence 
region, whereas in three dimensions it is just a 19.9% confidence regionj^] In general, the 
confidence cp contained inside a one-sigma contour of a P-dimensional Gaussian with P > 1 is 
given by, 



c P 



(2tt) p / ; 



-2tt2 



P-2 



drr P-l e rV2 



(23) 



Table [2] gives cp for P-dimensional Gaussians with P < 10, in order to give an impression of 
how quickly the confidence declines. Evidently, one needs to be very careful when interpreting 
one-sigma contours in more than one dimension. 



9 Note our terminology: We are talking of a "68.3% confidence interval", not of a "one-sigma interval". 
10 A symmetric confidence interval may not be sensible in case of a highly asymmetric likelihood function. 
As a nice example, consider the likelihood function of example 3.1 shown in Fig. [3] Furthermore, the central 
interval would cause the "maximum" at / = to lie outside this confidence interval. 

11 In order to obtain the two-dimensional result, solve the integral dip J° dr r 2 ^ 2 exp — that assumes 
a spherically symmetric Gaussian given in polar coordinates. 
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If the central-limit theorem does not apply - e.g., because the number N of measured data 
is small or the likelihood function itself is not well-behaved - things get even more involved. 
Nonlinear correlations in the parameters, i.e., "banana-shaped" error contours, are an obvious 
indicator for this case. The symmetric confidence region can still be defined easily, but it 
obviously lacks the ability to describe parameter correlations. Identifying the "shortest region" 
or "central region" may be computationally very expensive. Barlow (1993) recommends a 
definition of the confidence region via the contour at which the logarithmic likelihood is 0.5 
lower than that at its maximum, i.e., where the likelihood function takes 1/e of its maximum 
value. However, the degree of confidence of the resulting region strongly depends on the number 
of dimensions, similarly to the Gaussian case discussed above. 

As an alternative approach I now discuss a 
method that is designed to provide a 68.3% con- 
fidence region. Similarly to the method recom- 
mended by Barlow (1993), it employs contours 



55. 



2_ 



(a) 












(b) _ /\ 















on which the likelihood function is constant. The 
recipe is as follows: 

1. Locate the maximum of the likeli- 
hood function, £ max , i-e., perform the 
maximum-likelihood estimation of the pa- 
rameters. 

2. Identify the contours where the likelihood 
function takes some constant value Cq < 
£ max and integrate £ over these regions in 
order to get their confidence level. 

3. Adjust the contour level C such that the 
resulting region in parameter space has 
68.3% confidence. 

We visualise this method in Fig. [7] where, for 
the sake of visualisation, there is only a sin- 
gle parameter. Moreover, Fig. [7] also demon- 
strates that this method may provide a 68.3% 
confidence region that consists of various dis- 
connected regions, if the likelihood function is 
multimodal. Actually, this is not a real problem 

because the disconnected regions then also indicate the other local maxima. If the likelihood 
function is highly multimodal as in Fig. [7] and also in Fig. |6j this clearly makes more sense 
than using confidence intervals that are symmetric, central or shortest. 



Figure 7: Confidence regions for an example 
likelihood function resulting from Barlow's 
method (a) and the alternative method pre- 
sented in this manuscript (b). The horizon- 
tal dashed red lines indicate the slices at Co. 
In panel (a) Barlow's region has 80.4% con- 
fidence. In panel (b) the 68.3% confidence 
region consists of two regions that are not 
connected. 



2.5.3 The worst-case scenario 

A final word of caution, before specific methods are discussed: As discussed earlier, parameter 
estimation via maximising a likelihood function and quantifying its uncertainties via certain 
intervals is just a makeshift. If it is possible, it greatly simplifies the interpretation because 
it describes the likelihood function by a simple set of numbers. However, in practice, this 
procedure may turn out to be infeasible. In such cases, we have to resort to working with the 
full likelihood function itself, without further simplifications. 
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3 Error estimation for model-based parameters 

I now discuss how to perform an error estimation in the case of model-based parameter esti- 
mates. In particular, I discuss four different methods - brute-force grids, varying x 2 , the Fisher 
matrix, and Monte-Carlo methods - which are very common in astronomy. 



3.1 Brute- force grids 



A very straight-forward approach to both pa- 
rameter estimation and subsequent error esti- 
mation are brute-force grids. In this case, we 
span a (rectangular) grid in the parameter space 
and evaluate the likelihood function at every grid 
point. If we have only one or two model param- 
eters, we can then plot the likelihood function 
directly and we can also directly infer the error 
estimates from the contour lines. 

This method is very robust, because it only 
relies on the assumption that the error distribu- 
tion of the measured data is correct, i.e., that we 
are certain about the likelihood function. How- 
ever, if the number of model parameters is large, 
this method becomes computationally infeasible. 
Computational infeasibility may even be an issue 
for only two or even a single model parameter, 
if it is very expensive to evaluate the likelihood 
function. 

3.1.1 Application to example data 
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Figure 8: Error estimation via brute-force 
grids for the standard data. Panel (a) as- 
sumes the error distribution to be Poisso- 
nian, whereas panel (b) assumes it to be 
Gaussian. Both likelihood functions C(fi) 
(solid blue curves) peak at n ~ 10.07. The 
dashed red curves are matched Gaussians 
that are slightly shifted upwards for the sake 
of visibility. The Gaussian's standard devia- 
tions (which provide the error estimates) are 



bp ~ 0.59 for panel (a) and ~ 0.59 for 
panel (b). 



This method is applied to the example data of 
Table [T] and used to try to estimate the mean and 
its error. For candidate values of \i G [1, 19] the 
likelihood function £(/i) is computed assuming 
the data's error distribution to be Poisson and 
Gaussian. Figure [8] shows £(/i) for both cases. 
The function £(/x) is then matched by a Gaus- 

siarj^ and the standard deviation of this Gaussian is the error estimate. This provides the 
estimate ~ 0.59 for both Poisson and Gaussian error distributions. Concerning example 3, 
Figure [3] is essentially a brute-force grid parameter estimation. 



3.2 Varying x 2 

X 2 has already been introduced in Eq. Let us assume that the model parameters were 

estimated by minimising x 2 - We then vary the model parameters slightly around the optimal 
values such that x 2 changes by less than 1. In other words, we look for the contour where 
X 2 = Xmm + 1> thereby defining the error estimate of the model parameters. The basic idea 
here is that if the likelihood function were Gaussian, this would yield the la contour. 

12 This is motivated by the central-limit theorem, but the Gaussian approximation needs to be checked. 
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The crucial assumption of this method is that the error distribution of the measured data 
is indeed Gaussian, because otherwise using a \ 2 does not make sense (Sect. 2.3.1). Moreover, 
this method relies on an accurate measurement of the data errors a r , . 



3.2.1 An example of bad practice 

In this section, I want to discuss an example of bad practise that I discovered as a refereed 
publication, which provides a couple of lessons to learn. For obvious reasons, I will not give a 
reference here. The publication pushes the method of varying \ 2 even further in an attempt to 
ensure that the measurement errors a n are correct. Fitting a model with P parameters to N 
data points, the authors demand that 



X 



N-P <=> 



Xrcd 



X 



N-P 



1 



(24) 



where x 2 e d * s called "reduced x 2 "- Loosely speaking, this means that each degree of freedom 
contributes one standard deviation. The authors then try to correct for potentially wrong 
values of the a n by rescaling them such that xLi = 1- 

There are several objections to this method. 
In order to see this, we need to consider the as- 
sumptions involved: 

1. The error distribution has to be Gaussian, 
as in varying \ 2 ■ The authors do not jus- 
tify this assumption. 

2. The model has to be linear in all P fit 
parameters. If the model is nonlinear, we 
cannot demand that x 2 e d = 1> because the 
derivation of \ 2 = N — P implicitly as- 
sumes linearity in all parameters (e.g. cf. 
Barlow||1993[ )p1 A gain, the authors do not 
comment on this, in fact, they even never 
explicitly say what their model is. 




Figure 9: Error estimation via varying \ 2 f° r 
the standard data. The minimum of x 2 oc_ 



10.09 as suggested by Eq. (10). 



3. By demanding xted = 1, we explicitly claim 
that the model we are using is the correct 
model that was underlying the data. This 
is a rather optimistic claim. Of course, we 
would like our model to be the truth, but 
this claim requires justification. Moreover, 
in many practical situations we are actu- 
ally not interested in the truth, but rather 
make use of a model which expresses a rea- 
sonable simplification. 

Even if all these assumptions are met, the 

method is in fact only applicable if the degrees of freedom N — P are large. The reason is 
that the uncertainty in the measured data does not only cause an uncertainty in the model 



curs at fit, 

There \ 2 ~ 12.4, as indicated by the lower 
horizontal, dashed, red line. The upper hor- 
izontal, dashed, red line indicates the value 
12.4 + 1.0 = 13.4. The error estimates are 
defined by those points where x 2 — 13.4, as 
indicated by the two vertical solid black lines. 
The resulting error estimate is ~ 0.57. 



13 More precisely: We need to know the number of degrees of freedom in order to evaluate X? d- As we 
are going to show in an upcoming publication, estimating the number of degrees of freedom is possible but 
nontrivial for linear models. For nonlinear models, however, estimating the number of degrees of freedom is 
virtually impossible, unless the central-limit theorem provides a good approximation. 
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parameters, but also an uncertainty in the value of x 2 
the so-called ^-distribution (e.g 



itself. The value of x 2 is subject to 



sec 



Barlow 1993) whose expectation value is indeed N — P. 



However, this distribution is not sharp but has a nonzero variance of 2(iV — P). Consequently, 
if iV — P is small, there is a large relative uncertainty on the value of x 2 ■ This means x 2 
deviate substantially from N — P even though the model is linear and correct. 



may 



3.2.2 Application to example data 



In Figure |9j this method is used to estimate the error on the mean estimator given by Eq. (10). 
Given the data of Table [T] and /t ~ 10.09, the minimal value of \ 2 is ~ 12.4 The next step is 
to look for those values of /i where x 2 takes the values 12.4 + 1.0 = 13.4. In this simple case, 
these two points are symmetric around the minimum because the model is linear. The resulting 

0.57 as in Sect. 



error estimate IS (Tii ~ 



3.1.1 



For more general models, the error estimates can 
be asymmetric. This method is not applicable to example 3, because the underlying error 
distribution is not Gaussian. 



3.3 Fisher matrix 



Error estimation via the Fisher matrix is a very popular approach (e.g. see Heavens 2009). 
Let us assume we have fitted our model parameters to our data via maximising the logarith- 
mic likelihood (e.g. minimising x 2 i n cas e of Gaussian noise). This method is based on the 
central-limit theorem which tells us that any well-behaved likelihood function is asymptotically 
Gaussian near its maximum. We may therefore write 



C(9) cx exp 



1 

2 1 



0o) 



(25) 



which is a P-dimensional (P-variate) Gaussian with mean 9q and covariance matrix S. This 
covariance matrix is the desired error estimate. On the diagonals it contains the variance 
estimates of each individual 9 P and the off-diagonals are the estimates of the covariances (see 



e.g. Barlow 1993, for more information about covariances). Comparing Eqs. (20) and (25), we 



identify 



d 2 log £ 



-i 



(26) 



Care should be taken with the order of indices and matrix inversion. The matrix of second 
derivatives of log £ is called "Fisher matrix" . If the second derivatives of log C can be evaluated 
analytically, this method may be extremely fast from a computational point of view. However, 
if this is impossible, they can usually also be evaluated numerically. By construction, this 
method can only describe elliptical error contours. It is impossible to obtain "banana-shaped" 
error contours from this method. 

Of course, this method also invokes assumptions that have to be checked. Those assumptions 

are: 



1. The error distribution of the measurements is known, i.e., the likelihood function is defined 
correctly. 



2. The second-order Taylor expansion of Eq. (20) is a good approximation. 



14 Note that this deviates from N — 1 = 29 due to uncertainty in x 2 induced by the small number of data 
points. 
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This second assumption is the actual problem. Although the central-limit theorem ensures this 
asymptotically, Fig. [4] shows an example where this assumption of the Fisher matrix breaks 
down. There are two simple tests to check the validity of the resulting covariance-matrix 
candidate S. A valid covariance matrix has to be positive definite, i.e., x T • S • x > for any 
nonzero vector x, and both tests try to check this: 

1. Compute the determinant det S. If det S < 0, S is not valid. 

2. Diagonalise the matrix S in order to determine its eigenvalues. If any eigenvalue is 
negative or zero, S is not valid. 

The first test is usually easier to perform, whereas the second test is more restrictive. It is 
strongly recommended that these tests are applied whenever this method is used. Unfortu- 
nately, these tests are only rule-out criteria. If S fails any of these tests, it is clearly ruled 
out. However, if it passes both tests, we still cannot be sure that £ is a good approximation, 
i.e., that the Gaussian is indeed a decent approximation to the likelihood function at its maxi- 
mum. Nevertheless, the major advantage of this method is that it is very fast and efficient, in 
particular if we can evaluate the second derivatives of log C analytically. 

There are also situations where the Fisher matrix is definitely correct. This is the case for 
Gaussian measurement errors and linear models. In this case C(6) is truly a Gaussian even 
without any approximation. For instances, inspect Eq. Q: x 2 is a quadratic function of fi, 
and since C(/a) oc e~ x I 2 the likelihood function is a Gaussian. 



3.3.1 Example 1 revisited 



In order to see the Fisher matrix "in action", we now return to example 1 from Sect. 2.3.1 
The task was to estimate the mean /i of N data points {a^i, X2, ■ ■ ■ , x^} that are drawn from a 
Gaussian error distribution. The maximum-likelihood estimator for \x is given by Eq. @. The 
task now is to use the Fisher matrix in order to derive the error estimate for Eq. ([7]). Equation 
([9]) is derived once more with respect to /1, 



d 2 log C{D;li) 



da 2 ' cr 

n=l 

For the error estimate of fx this then yields 



- 1 



N - 



If all iV data points are again assumed to have identical errors, o~ n = a, this simplifies to 
a 2 , = o~ 2 /N which is the error estimate for the arithmetic mean in case of Gaussian measurement 
errors. 



3.3.2 Example 2 revisited 

The same exercise is repeated for example 2, where the error distribution is assumed to be 
Poissonian. The second derivative of Eq. ( 14 ) with respect to 11 reads 

d 2 log £(D;li) 



dy? 



1 N 

n=l 



(29) 
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For the error estimate of ft this yields 

-2 



N 



where J2n=i x n = Nfi has been identified according to the estimator of Eq. (15) 



(30) 



3.3.3 Application to example data 



Inserting the example data of Table [T] into Eq. (28), an error estimate of ~ 0.57 is obtained, 
which is in agreement with the error estimates from Sections 3. 1. 1| and |3.2.2 In the case of 
a Poisson error distribution, insertion of the example data of Table [l] into Eq. (30) yields 
<5~ M « 0.58, which is also in agreement with the result of Sect. 3.1.1 Both results are reliable, 
since in both cases the likelihood function is well approximated by a Gaussian, as we have seen 
in Fig. M 



3.3.4 Example 3 revisited 



Let us compute the second derivative of the logarithmic likelihood given by Eq. (17) w.r.t. /. 
We obtain, 



d 2 log£ 



n 
P 



N -n 



N 3 



n(N — n) 



where we have inserted the maximum-likelihood estimator / 
pretends to be 



(31) 



n/N . Hence, the error estimate 



n(N — n) 
A3 



which yields aj = in example 3.1 and aj ~ 0.004 in example 3.2. Now we are lucky, because 
cr| = tells us that we forgot that in example 3 the likelihood function is binomial and not 
Gaussian, i.e., this whole calculation was nonsense. Unfortunately, this is not necessarily that 
obvious, as the result for example 3.2 shows. 



3.4 Monte-Carlo methods 

Monte-Carlo methods directly draw samples from the likelihood function, i.e., they guess values 
of the fit parameters and accept them with the probability defined by the corresponding value 
of the likelihood function. Although these methods may appear difficult at first glance, Monte- 
Carlo sampling is actually the most intuitive approach to error estimation. The reason is its 
similarity to measuring errors of data by repeating the measurement process and monitoring 
the results. The strength of all Monte-Carlo methods is that they use a minimum amount of 
assumptions. Their sole and only assumption is that the error distribution of the measured 
data is known correctly, i.e., that we are certain about the likelihood function. There are no 
further requirements such as Gaussianity, which renders this approach very general. 

There are many different types of Monte-Carlo methods for many different situations. It 
would be beyond the scope of this manuscript to explain these methods. Instead, I explain 
which methods are useful in which kinds of situations and refer the interested reader to the 



literature. MacKay (2003), for example, provides an excellent introduction to all the methods 
named here. 
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The first criterion in choosing a particular 
Monte-Carlo algorithm is the number of model 
parameters. If the number of parameters is small 
- say 1 or 2 - we can use one of the following 
Monte-Carlo algorithms: uniform sampling, im- 
portance sampling, or rejection sampling. How- 
ever, if the number of parameters is large, these 
algorithms quickly become inefficient, i.e., com- 
putationally expensive. In the case of many 
model parameters, there is the family of Markov- 
chain Monte-Carlo (MCMC) algorithms, whose 
computation times scale linearly with the num- 
ber of model parameters. The most famous (and 
most simple) type of MCMC algorithm is the 
Metropolis-Hastings algorithm. However, this 
algorithm involves a stepsize for each parame- 
ter that needs to be fine tuned by hand. This is 
a real problem in the case of many parameters. 
We therefore recommend the slice-sampling al- 
gorithm, which also involves a stepsize but it is 




Figure 10: Error estimation via Monte-Carlo 
sampling for the standard data. The distri- 
butions of \x resulting from the MCMC algo- 
rithm assuming Poisson errors in panel (a) 
and Gaussian errors in panel (b) are both 
well approximated by a Gaussian. In panel 
(a) the Gaussian is given by ft = 10.10 and 
<r M = 0.58. In panel (b) the Gaussian is given 



tuned automatically without human interaction. 
In fact, there is also Gibbs sampling which does by fx = 10.10 and <5" M = 0.57. 
not involve any stepsizes at all. However, Gibbs 

sampling can only be used directly for trivial models. In order to employ Gibbs sampling in a 
wider context, one has to combine it with other one-dimensional sampling algorithms. 

In practice, we start an MCMC algorithm at the estimated maximum 8q of the likelihood 
function and let it draw M samples of model parameters {8\, ... ,8m} from the likelihood 
function 
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All we then need to do is to analyse the distribution of this set of parameters F] 
For instance, assuming Gaussianity, we can do this via estimating the mean and the covariance 
of this sample. However, the distribution can also be analysed more generally such that it is 
possible to describe nonlinear dependencies between different model parameters, e.g., "banana 
shaped" error contours. 



3.4.1 Application to example data 

Once again, this method is applied to the example data given in Table [T] A Metropolis- 
Hastings algorithm is used with stepsizes inspired by the error estimates obtained previously. 
For likelihood functions assuming Poissonian and Gaussian error distributions, respectively, the 
MCMC algorithm is iterated 100,000 times. Afterwards, the resulting sequences are thinned 
by picking out every tenth data point and discarding everything else. From the remaining 



10,000 data points the arithmetic mean and its standard deviation are estimated. In Fig. 10 



the resulting data distribution is overplotted by a corresponding Gaussian, which provides the 
error estimate. For the Poisson error distribution <t m = 0.58 is obtained, and <3" M = 0.57 for 
Gaussian error distribution. This is again in agreement with previous results. 



15 Initialising the MCMC at the optimum 9q also has the advantage that we do not need to use any convergence 
diagnostics, since we are already at the optimum. Convergence diagnostics for MCMC algorithms are still an 



open issue ( |Cowles fc C"arirn|[l996l ). 

16 In practice we often need to "thin" this sequence due to autocorrelations in the Markov chain, e.g., by 
selecting every tenth point and discarding all others. 
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4 Error estimation for model-independent parameters 

So far we have focused our attention on error estimation for model-based parameters. Model- 
based inference usually involves an optimisation problem which can complicate matters. There- 
fore, model-independent parameters that usually do not involve any optimisation are very 
popular in astronomy. I now explain how to estimate errors for parameters of this type. 

4.1 Error propagation 

One simple method is error propagation. If we can express the parameter as a function of the 
measured data and if the error distribution of the data is Gaussian, we can employ Gaussian 
error propagation. 

To give an example, let us consider the flux F of a galaxy. If the image background has 
been subtracted correctly, the flux is just given by the sum of all N pixel values ft in the image, 



N 

£ = ( 32 ) 
i=i 

As argued in Sect. 2.2 the error distribution in photometric images is in excellent approxima- 
tion to Gaussian, if the exposure time was long enough. If we denote the measurement error 
of pixel i by <7j, we can then estimate the error of F via 




^=>:k«5v (33) 



which is fairly simple in this case. However, this can become very cumbersome for more general 
model-independent parameters. In particular, it is impossible if a model-independent parameter 
involves an operation on the measured data that is not differentiable, e.g., selecting certain data 
points. In fact, error propagation can also be applied to model-based parameter estimates, if 
these estimators can be expressed as differentiable functions of the data, e.g., as is the case 



for linear models with Gaussian measurement errors. For instance, Equation (28) can also be 



derived from Eq. (10) using this method, giving the same result for the example data of Table 



[T] However, in general, this is not the case. 
4.2 Resampling the data 

A very elegant method for error estimation is to resample the measured data (for an example 



see e.g. Burtscher et al. 2009). Again, the assumption is that we know the correct error distri- 
bution of the measured data. For the sake of simplicity, let us assume this error distribution 
is Gaussian. For each data point x n we invoke a Gaussian with mean x n and standard devi- 
ation <j n as measured during the experiment. We can now sample a new data point x' n from 
this distribution]^] Doing this for all pixels, we get an "alternative" noise realisation of the 
measurement, e.g., a new image that can be interpreted as an alternative measurement result. 
We then estimate our model-independent parameter for this resampled data and the result will 
differ slightly from that obtained from the actual data. Repeating this resampling process, 
e.g., 100 times, and monitoring the resulting parameter estimates, we get a distribution of this 
model-independent parameter from which we can then infer an upper limit for the uncertainty. 
For instance, if the resulting parameter distribution is approximately Gaussian, the upper limit 
will be given by the Gaussian's standard deviation. 

17 In fact, this method is a Monte-Carlo method, too. 
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Figure 11: Error estimation via resampling 
the data, using the standard data. The 
distributions of fi resulting from the re- 
sampling procedure assuming Poisson er- 
rors in panel (a) and Gaussian errors in 
panel (b) are both well approximated by 
a Gaussian. In panel (a) the Gaussian is 
given by fi = 10.05 and a M = 0.57. In panel 
(b) the Gaussian is given by ft = 10.11 and 



0.55. 



Figure 12: Error estimation via bootstrap- 
ping the data, using the standard data. 
The distributions of /i result from the boot- 
strapping procedure and estimating the 
Poisson mean via Eq. (15) in panel (a) 



and estimating the Gaussian mean via Eq. 
(10) in panel (b). Both distributions are 
well approximated by a Gaussian. In panel 
(a) the Gaussian is given by ft = 10.06 and 
o~ /t = 0.36. In panel (b) the Gaussian is 
given by ft = 10.08 and cr M = 0.38. 



Why does this method only provide an upper limit for the uncertainty? The reason is 
that even though we are using the correct error distribution of the measured data, we are 
centering this error distribution at the measured value instead of the (unknown) true value. 
This introduces additional scatter and leads us to overestimate the uncertainty. Nevertheless, 
it may still be acceptable to use the overestimated uncertainty as a conservative estimate, 
depending on the precise scientific question. 

This method is a very intuitive approach to error estimation, because it simulates repeated 
measurements of the data. It can also be applied to error estimation for model-based parameter 
estimates. 



4.2.1 Application to example data 

Again, this method is applied to the example data of Table [T] The measured data points 
are assumed to be the means of the error distribution (Poisson and Gaussian) and each data 
point is resampled from its error distribution. For this resampled data, the Poisson mean is 
estimated via Eq. (15) and the Gaussian mean via Eq. (10). This is repeated 1,000 times and 
the estimated means are monitored. Figure 11 shows the resulting distribution of mean values. 
The error estimates agree well with those obtained from the other methods. 



4.3 Bootstrapping 

Bootstrapping ( Efron|1979[ Hastie et al. 2009) is another resampling method for error estimation 
that can be applied to model-based as well as model-independent parameter estimation. Let 
us assume we have N measurements {x\, X2, ■ ■ ■ , Xjy} from which we estimate some parameter. 
Again, we resample our data in order to create "alternative" data sets from which we then 
repeatedly estimate the parameter of interest, monitoring its distribution as before. However, 
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the details of the resampling process are different from those in Sect. |4.2| Instead of resampling 
each data point from its individual error distribution, we draw new samples from the measured 
data set itself. Drawing these "bootstrap" samples is done with replacement, i.e., the same 
data point can occur multiple times in our bootstrap sample. To give an example, we consider 
a data set {x±, x 2 , x$, X4}. Some examples for its bootstrap samples are: 

• {xi, x 2 , x 3 , £4} itself, of course. 

• {x 1 ,X 2l X li X4}, 

• {xi,x 2 ,x 2 ,x 4 }, 

• {Xi,X 3 ,X 3 ,X 3 }, 

• {x 2 ,x 2 ,x 2 ,x 2 }, which is bad but possible. 

As the same data point can occur multiple times but ordering is not important, for iV data 
points the total number of possible different bootstrap samples is 

27V -1 ^ _ (27V -1)! 



N ) JVI(JV-l)!' 

In practice, the number of bootstrap samples chosen is set to some useful number, where 
"useful" is determined by the trade-off between computational effort and the desire to have as 
many samples as possible in order to get a good estimate. 

The major advantage of bootstrapping is that the error distribution of the measured data 
does not need to be known (unless the parameter estimation itself requires it). The crucial 
assumption here is that the measured data sample itself encodes the information about its 
error distribution. However, the parameter estimation must be capable of dealing with the 
bootstrapped samples which, in general, include certain data points multiple times while com- 



pletely lacking other data points. For instance, the flux estimator of Eq. (32) would not be 
capable of handling bootstrap samples, since all pixels have to contribute precisely once. Nev- 
ertheless, if we know the data's error distribution, we should really exploit this knowledge by 
using, e.g., resampling instead of bootstrapping. 



4.3.1 Application to example data 

Finally, also bootstrapping is applied to the data of Table [T] This data sample contains N = 30 
values, i.e., the total number of possible bootstrap samples is ~ 5.9- 10 16 . Then, 10,000 bootstrap 
samples are drawn and for every sample the Poisson mean is estimated via Eq. (15) and the 
Gaussian mean via Eq. (10). Figure 12 shows the resulting distributions. Obviously, the mean 
values are estimated correctly. However, the errors are underestimated, especially in the case 
of the Gaussian. The likely explanation is that the data sample is not large enough in order to 
contain sufficient information about its underlying error distribution. 



5 Propagating measurement errors through data-reduction 
pipelines 

Usually, directly measured data has to be preprocessed before it can be analysed further. The 
preprocessing is typically done by some data-reduction pipeline. For instances, the data coming 
out of a spectrograph is in its raw state rather useless. Before one can analyse it, one has to 
subtract the bias, correct for the flat field response of the CCD, and calibrate wavelength and 
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flux. This preprocessing is usually done either using complete instrument-specific pipelines, 
more general software routines such as those found in IRAF or MIDAS, or a combination of 
both. Now the question arises, how to propagate the errors on the initial measurements through 
such complex preprocessing? 

In general, an analytic error propagation such as that discussed in Sect. 4^ is impossible, 
either because the data-reduction pipeline is too complex, or because the pipeline is used as 
a "black box". Nevertheless, it is possible to propagate the errors through the pipeline via 



resampling (Sect. 4.2), though it may be computationally expensive. Let us outline this 



method for our previous example of spectral measurements. We do have the raw spectral data, 
the measured bias fields and flat fields. We resample each of these fields as described in Sect. 



4.2, say iV resamplings, assuming the measurement errors are Gaussian (or Poisson, if we count 



only few photons). Then, we feed each resampling instance through the pipeline and monitor 
the outcome. The result will be a set of reduced spectra, which provide an estimate of the 
reduced spectrum's error distribution. 



Of course, this method may be computationally expensive in practice,^ However, as we 
have argued earlier, an error estimate is inevitably necessary. Therefore, if this method is the 
only possibility to get such an error estimate, computational cost is not an argument, 



1!) 



6 Summary 

I have discussed different methods for error estimation that apply to model-based as well as 
model-independent parameter estimates. The methods have been briefly outlined and their 
assumptions have been made explicit. Whenever employing one of these methods, all assump- 
tions should be checked. Table [3] summarises all the methods discussed here and provides a 
brief overview of their applicability. It was beyond the scope of this manuscript to describe 
all methods in detail. Where possible I pointed to the literature for examples. Furthermore, I 
have also outlined how one can propagate errors through data-reduction pipelines. 

My recommendations for error estimation are to use Monte-Carlo methods in case of model- 
based parameter estimates and Monte- Carlo-like resampling of the measured data in case of 
model-independent parameter estimates. These methods only require knowledge of the mea- 
surement errors but do not invoke further assumptions such as Gaussianity of the likelihood 
function near its maximum. Bootstrapping may also be an option if sufficient data are available. 

I conclude with some recommendations for further reading: 



Barlow (1993): An easy-to-read introduction into the basics of statistics without going 
too much into depth. Recommendable to get a first idea about parameter estimation, 
error estimation, and the interpretation of uncertainties. 



Press et al. (2002): This book contains some very useful chapters about statistical theory, 
e.g., linear least-squares fitting. It is excellent for looking up how a certain method works, 
but it is not meant as an introduction to statistics. 



Hastie et al. (2009): A good textbook giving a profound introduction into analysing data. 
However, the main focus of this book is on classification problems, rather than regression 
problems. Given the rather mathematical notation and compactness, this book requires 
a certain level of prior knowledge. 



18 Although after the necessary groundwork has been covered, the batch processing of simple spectra usually 
takes no more than a few seconds a piece. 

19 In fact, one may argue that computational cost is not an argument anyway. If something is computationally 
too expensive, then one should buy more computers! Unfortunately, this approach is usually hampered by the 
fact that licensed software is very popular in astronomy (e.g. IDL). 
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method section model-based 


model-independent 


data error 


contours 


brute-force grids 


3.1 


yes 


no 


known 


arbitrary 


varying \ 2 


3.2 


yes 


no 


Gaussian 


arbitrary 


Fisher matrix 


3.3 


yes 


no 


known 


elliptical 


Monte-Carlo methods 


3.4 


yes 


no 


known 


arbitrary 


error propagation 


4.1 


depends 


yes 


Gaussian 


elliptical 


resampling the data 


4.2 


yes 


yes 


known 


arbitrary 


bootstrapping 


4.3 


yes 


yes 


unknown 


arbitrary 



Table 3: Methods for error estimation discussed in this manuscript. This table gives a brief 
overview of each method: Specifically, whether a certain method is applicable to model-based 
and/or model-independent parameter estimates, whether knowledge about the data's error 
distribution is necessary, and what kind of error contours can be estimated. 



MacKay (2003): The focus of this textbook is again mainly on classification, but it 
also provides a broader overview of concepts of data analysis and contains an excellent 
introduction to Monte-Carlo methods. This textbook also requires a certain level of prior 
knowledge. 
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